Using version 3 of the MN output.

Review

Subsequence:
1. check_parity: there is only one parity in house numbers (either odd or even)
2. check_direction: house number is only increasing/decreasing
3. check_street: there is only one street name
4. within specified jump size

Merged sequence:
1. there is only 1 street name (only when check_street is set to TRUE)
2. there are no more than 2 house numbers whose parities are different from the rest of the house numbers in the sequence
3. Adjacent house numbers to not differ for more than 10. The number can be set by setting jump_size.

Overview of results

st_on st_off
min 1 1
mean 26.70967 21.02628
median 6 5
max 4945 4941
count 3744 4756

Can we use sequences to suggest what NA house numbers should be?

Quick recap:
- Why are there NA house numbers: fill down on cleaned house numbers (output of 04 clean and 05 fill down) was done using ED and best match. So NAs will result (most likely) from having a dissimilar best_match above/below or (unlikely) a dissimilar ED.
- also note that best_match itself was filled down: hence, these NA house numbers had a raw street address attached to them, but did not have any house numbers
- How did sequence generation deal with NAs: In the sequence generating function, sequences were generated using a df that had NAs removed. Then, to attach a SEQ to it, fill down then up was done (without any restriction on street).

A look at all the sequences with NAs:

There are two types of sequences containing NAs.
Type 1: NAs occur at the end of the sequence, suggesting that the non-NA sequence before and the NA sequence after are distinct sequences. An example:

street_add best_match result_type house_num hn_1 hn_2 hn_3
E 98 ST E 90 2 200 200 NA NA
E 98 ST E 90 2 200 200 NA NA
E 98 ST E 90 2 200 200 NA NA
E 98 ST E 90 2 200 200 NA NA
E 98 ST E 90 2 200 200 NA NA
E 98 ST E 90 2 200 200 NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA
E 93 E 93 1 NA NA NA NA

Type 2: NAs occur between two sequences of house numbers, i.e. there is a block of streets with no house number sandwiched between 2 sequences that would’ve been joined. An example:

street_add best_match result_type house_num hn_1 hn_2 hn_3
KINGSBRIDGE ROAD HAWTHORNE 4 4850 4850 NA NA
KINGSBRIDGE ROAD HAWTHORNE 4 4850 4850 NA NA
KINGSBRIDGE ROAD HAWTHORNE 4 4850 4850 NA NA
HANTHORNE ST HAWTHORNE 2 4850 4850 NA NA
COOPER ST COOPER 1 NA NA NA NA
COOPER ST COOPER 1 NA NA NA NA
COOPER ST COOPER 1 NA NA NA NA
COOPER ST COOPER 1 NA NA NA NA
HAWTHORNE ST HAWTHORNE 1 4850 4850 NA NA
HAWTHORNE ST HAWTHORNE 1 4850 4850 NA NA

Could this be useful in further street name cleaning?

street_add best_match result_type house_num hn_1 hn_2 hn_3
BAYTES ST BAXTER 2 79 79 NA NA
BAYTES ST BAXTER 2 79 79 NA NA
BAYTES ST BAXTER 2 79 79 NA NA
BAYTES ST BAXTER 2 79 79 NA NA
BAYTES ST BAXTER 2 79 79 NA NA
BAYTES ST BAXTER 2 79 79 NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BANTAS STREET CANAL 3 NA NA NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER ST BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 79 79 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
BAXTER STREET BAXTER 1 81 81 NA NA
type count freq
1 24 0.75
2 8 0.25

Perhaps not really. Esp since some may not be confident changes (ie if original match already very confident).

EDA of sequences for further street address cleaning

When creating merged sequences, we can decide if sequences should have the same street name or if this rule can be relaxed. We know that if this rule is specified, more sequences are generated- which suggests that without taking into account street name, some sequences have more than 1 street name. It could be useful to check if these sequences can identify errors in the street name cleaning process.

Using result type

When determining if a street has been wrongly matched, we could use the following steps:
1. Look at sequences (generated with check_street off) with multiple street names in them
2. If any of the multiple street names are of a non-1/2 result type, use the predominant street name in the sequence instead
- this works because when filling down, we used the select the most similar string from a pool of 3 above/below records. for nonsensical street names, this approach may not be ideal.
3. Determine if the different street names are close in spatial proximity: as enumerator may have cross an intersection - if they are close, leave it, else change to predominant street name (archived)

Some examples:

Apel

street_add best_match result_type hn_1
MOTT ELIZEBETH MOTT 4 72
MOTT ST MOTT 1 72
BAYARD BAYARD 1 66
APEL WALKER 3 66
BAYARD ST BAYARD 1 66
BAYARD ST BAYARD 1 70
MOTT MOTT 1 72

East

street_add best_match result_type hn_1
EAST 1 AVE 3 301
EAST 1 AVE 3 303
EAST 1 AVE 3 307
EAST 1 AVE 3 309
EAST 1 AVE 3 311
99TH STREET EAST E 99 3 311
EAST 99TH STREET E 99 2 311

St West

street_add best_match result_type hn_1
ST WEST W 162 5 1052
ST NICHOLAS AVE NICHOLAS AVE 1 1054
ST NICHOLAS AVE NICHOLAS AVE 1 1056
ST NICHOLAS AVE NICHOLAS AVE 1 1058
ST NICHOLAS AVE NICHOLAS AVE 1 1064
ST NICHOLAS AVE NICHOLAS AVE 1 1066
ST NICHOLAS AVE NICHOLAS AVE 1 1072
ST NICHOLAS AVE NICHOLAS AVE 1 1074

How many can we clean?

Current criteria: ‘faulty’ match is result type > 2, above and below are good matches (1-2), house numbers do not differ by more than 4 in ‘faulty’ match, above and below is the same if they exist (if exist because some faulty row may be the start of a seq).

## [1] 38
But!! Some errors still:
street_add best_match result_type hn_1
FIRST AVENUE 1 AVE 1 400
E 91 ST E 91 1 404
EAST 91 STREET E 91 3 404

Spatial Proximity - Archived (too complex for now)

For situations where it’s possible the the multiple street names are correct:

If 30 Roosevelt is near 38 New Bowery, this could be correct and left alone

But this can identify errors. E.g. we know from manual checking that Pearl was somehow mistranscribed here. If Pearl and Madison are far apart, this process would be able to correct that:

Note: we should be quite strict about this process as we do not want our error rate to increase unnecessarily. More EDA needs to be done to check if this process is worth carrying out.

Next Steps

  1. Decide on steps we want to take + ORDER OF OPERATIONS (clean street first or clean number first?)
  2. Write functions for street name cleaning based on sequences.